[fix](function) fix tokenize function incorrect result when first argument is const#62699
Conversation
|
Thank you for your contribution to Apache Doris. Please clearly describe your PR:
|
…ument is const
After unpack_if_const, the inner data column of a ColumnConst has only 1 row.
The _do_tokenize and _do_tokenize_none functions iterate based on the source
column's row count, so when the first argument is a constant (e.g.,
SELECT tokenize('hello world', 'parser=english') FROM table), only 1 output
row was produced instead of input_rows_count rows. Fix by wrapping the result
in ColumnConst when the source column was const.
### What problem does this PR solve? Problem Summary: clang-format-16 reported a formatting deviation in `function_tokenize.cpp` where the `ColumnConst::create(...)` argument fit on the same line as `result`. Join the two arguments on a single line so the file passes `build-support/check-format.sh`. ### Release note None ### Check List (For Author) - Test: No need to test (formatting-only change) - Behavior changed: No - Does this need documentation: No
20601b5 to
05e014f
Compare
|
run buildall |
TPC-H: Total hot run time: 30960 ms |
TPC-DS: Total hot run time: 171976 ms |
BE UT Coverage ReportIncrement line coverage Increment coverage report
|
BE Regression && UT Coverage ReportIncrement line coverage Increment coverage report
|
|
/review |
There was a problem hiding this comment.
I found one blocking review issue: this BE function behavior fix does not add regression coverage, and the PR's motivating all-literal example should normally be handled by PreparedFunctionImpl's default constant-argument path before execute_impl reaches the new left_const branch. Please add a regression test that fails before this change and exercises the exact intended path, for example a multi-row query with a constant first argument and valid tokenizer properties, with deterministic ordered output or row-count validation through qt/order_qt output.
Critical checkpoint conclusions:
- Goal: intended to fix tokenize row count for const first argument; the code may cover a lower-level const/non-const execution path, but the claimed SQL example is not proven by tests.
- Scope: code change is small and localized.
- Concurrency/lifecycle/config/compatibility: not involved.
- Parallel paths: generic all-constant handling exists and should be considered; no FE/BE protocol changes.
- Tests: insufficient; no regression or unit test was added for the behavior change.
- Observability/transactions/data writes: not involved.
- Performance/memory: no material new hot-path concern from the small wrapper change.
User focus points: none were provided.
| _do_tokenize_none(*col_left, dest_column_ptr); | ||
| block.replace_by_position(result, std::move(dest_column_ptr)); | ||
| if (left_const) { | ||
| block.replace_by_position( |
There was a problem hiding this comment.
This behavior change needs regression coverage. The PR description uses SELECT tokenize('hello world', 'parser=english') FROM table_with_many_rows, but when both arguments are literals the generic PreparedFunctionImpl::default_implementation_for_constant_arguments should unwrap, execute one row, and wrap the result as a ColumnConst before this left_const branch is reached. Please add a regression test that fails before this patch and exercises the actual intended path, with multiple input rows and deterministic expected output/row count, so we can verify this branch fixes a real user-visible case rather than an untested lower-level edge case.
Adds regression coverage for the const-first-argument fix in
FunctionTokenize::execute_impl. The generic
PreparedFunctionImpl::default_implementation_for_constant_arguments
path only fires when all arguments are constant, so the original bug
example `SELECT tokenize('hello', '"parser"="english"') FROM t` never
actually reaches execute_impl. The new tests use a parser_config
column to force a non-const second argument, which is the path where
the left_const branch matters.
Both the analyzer path (parser=english) and the PARSER_NONE early
return are covered. Assertions check that row count equals
input_rows_count (was 1 before the fix) and that every row carries
the expected tokenized value.
|
/review |
|
run buildall |
There was a problem hiding this comment.
I found one blocking issue in the added regression coverage. The BE fix itself is small and matches the local const-column handling pattern, but the new test uses a query shape that FE rejects before BE execution, so it does not provide valid coverage and should fail the regression suite.
Critical checkpoint conclusions:
- Goal/test proof: the code aims to preserve row count when
tokenizereceives a const first argument afterunpack_if_const; the implementation addresses the BE row-count issue, but the added regression does not prove it because the second argument is not legal SQL fortokenize. - Scope/focus: the BE code change is small and focused; the test addition is larger than necessary and currently invalid.
- Concurrency/lifecycle/config/compatibility: no new concurrency, special lifecycle, config, storage format, or protocol compatibility concerns found.
- Parallel paths: both analyzer and
PARSER_NONEBE branches were updated consistently. - Error handling/memory/observability: no new unchecked
Status, ownership, or observability issue found in the BE change. - Test coverage/results: coverage remains blocking because the new regression queries should fail analysis instead of exercising
FunctionTokenize::execute_impl. - User focus: no additional user-provided review focus was specified.
BE UT Coverage ReportIncrement line coverage Increment coverage report
|
The earlier regression-test addition was invalid: FE `Tokenize.checkLegalityBeforeTypeCoercion()` requires the second argument to be a `StringLikeLiteral`, so the test query that read the parser config from a table column failed analysis instead of reaching `FunctionTokenize::execute_impl`. And when both arguments are literals `PreparedFunctionImpl::default_implementation_for_constant_arguments` already short-circuits with the all-const fast path, so there is no SQL shape that exercises the new `left_const` branch. Drop the invalid regression test and add a BE unit test that builds the block directly: arg0 is wrapped in a `ColumnConst(size=N)` whose inner data column has 1 row, arg1 is a regular `ColumnString` with N rows. With that shape `all_arguments_are_constant` is false, the generic fast path skips, `execute_impl` is reached, and `unpack_if_const(arg0)` returns `left_const == true`. The assertions check that the result column has N rows (was 1 before the fix) and that every row carries the same tokenized value, for both the analyzer branch and the `PARSER_NONE` early-return branch.
|
/review |
TPC-H: Total hot run time: 31259 ms |
There was a problem hiding this comment.
Automated review summary for head de5e1d7:
No additional distinct inline issues found in this pass. I read the existing review threads first and did not duplicate the previously raised concerns. The current diff only changes BE tokenize const-result handling and adds BE unit coverage; the previously changed Groovy regression file is no longer part of the current diff.
Critical checkpoint conclusions:
- Goal/test: The patch preserves row count when tokenize receives a const first argument after
unpack_if_const, and adds BE unit tests for both the analyzer path andparser='none'early-return path. - Scope: The implementation is small and focused on wrapping the single computed tokenized value as
ColumnConst(input_rows_count)only when the first argument was const. - Concurrency/lifecycle: No new shared state, locks, threads, static initialization, or lifecycle-sensitive ownership was introduced.
- Config/compatibility/protocol: No configuration, storage format, protocol, or FE/BE serialization compatibility changes were introduced.
- Parallel paths: Both normal tokenizer and
PARSER_NONEearly-return paths now handleleft_constconsistently. - Conditional checks: The added
left_constbranches follow the existing constant-column semantics and are narrow. - Test coverage/results: New BE unit tests cover the direct BE path. I did not run tests in this review environment.
- Observability: No new observability appears necessary for this local scalar-function behavior.
- Transaction/persistence/data write correctness: Not applicable; this is a read-only scalar function path.
- FE-BE variable passing: Not applicable; no new variables are transmitted.
- Performance: The change avoids materializing repeated equal string results and uses
ColumnConst, so there is no obvious new hot-path performance issue.
User focus: No additional user-provided review focus was specified, and no extra focus-specific issue was found.
|
run buildall |
TPC-DS: Total hot run time: 171922 ms |
TPC-H: Total hot run time: 31258 ms |
TPC-DS: Total hot run time: 172530 ms |
|
skip check_coverage |
BE Regression && UT Coverage ReportIncrement line coverage Increment coverage report
|
…ument is const (#62699) ## Proposed changes Fix a bug in the `tokenize` function where `unpack_if_const` unwraps a `ColumnConst` to its inner data column (which has only 1 row), but `_do_tokenize` and `_do_tokenize_none` iterate based on the source column's row count. This causes only 1 output row to be produced instead of `input_rows_count` rows when the first argument is a constant. For example, `SELECT tokenize('hello world', 'parser=english') FROM table_with_many_rows` would previously return only 1 row instead of the expected number of rows matching the table. The fix wraps the result in `ColumnConst` when the source column was const, which is the standard pattern used throughout the Doris codebase for handling const columns in function execution. ## Further comments Related Jira: DORIS-25296 ## Checklist(Required) 1. Does it affect the results of the existing test cases (Yes/No): No 2. Does it need to update the document (Yes/No): No 3. Is there a risk of compatibility changes (Yes/No): No
Proposed changes
Fix a bug in the
tokenizefunction whereunpack_if_constunwraps aColumnConstto its inner data column (which has only 1 row), but_do_tokenizeand_do_tokenize_noneiterate based on the source column's row count. This causes only 1 output row to be produced instead ofinput_rows_countrows when the first argument is a constant.For example,
SELECT tokenize('hello world', 'parser=english') FROM table_with_many_rowswould previously return only 1 row instead of the expected number of rows matching the table.The fix wraps the result in
ColumnConstwhen the source column was const, which is the standard pattern used throughout the Doris codebase for handling const columns in function execution.Further comments
Related Jira: DORIS-25296
Checklist(Required)